NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Assessing Educational Quality: Comparative Analysis of Crowdsourced, Expert, and AI-Driven Rubric Applications

https://doi.org/10.1609/hcomp.v12i1.31606

Moore, Steven; Bier, Norman; Stamper, John (October 2024, Proceedings of the AAAI Conference on Human Computation and Crowdsourcing)

Exposing students to low-quality assessments such as multiple-choice questions (MCQs) and short answer questions (SAQs) is detrimental to their learning, making it essential to accurately evaluate these assessments. Existing evaluation methods are often challenging to scale and fail to consider their pedagogical value within course materials. Online crowds offer a scalable and cost-effective source of intelligence, but often lack necessary domain expertise. Advancements in Large Language Models (LLMs) offer automation and scalability, but may also lack precise domain knowledge. To explore these trade-offs, we compare the effectiveness and reliability of crowdsourced and LLM-based methods for assessing the quality of 30 MCQs and SAQs across six educational domains using two standardized evaluation rubrics. We analyzed the performance of 84 crowdworkers from Amazon's Mechanical Turk and Prolific, comparing their quality evaluations to those made by the three LLMs: GPT-4, Gemini 1.5 Pro, and Claude 3 Opus. We found that crowdworkers on Prolific consistently delivered the highest-quality assessments, and GPT-4 emerged as the most effective LLM for this task. Our study reveals that while traditional crowdsourced methods often yield more accurate assessments, LLMs can match this accuracy in specific evaluative criteria. These results provide evidence for a hybrid approach to educational content evaluation, integrating the scalability of AI with the nuanced judgment of humans. We offer feasibility considerations in using AI to supplement human judgment in educational assessment.
more » « less
Full Text Available
Platform-based Adaptive Experimental Research in Education: Lessons Learned from The Digital Learning Challenge

https://doi.org/10.1145/3706468.3706471

Musabirov, Ilya; Reza, Mohi; Song, Haochen; Moore, Steven; Chen, Pan; Kumar, Harsh; Li, Tong; Stamper, John; Bier, Norman; Rafferty, Anna; et al (March 2025, ACM)

Free, publicly-accessible full text available March 3, 2026
User-Centered Design to Democratize Research Experiments on Digital Learning Platforms

An, Marshall; Fortunato, Jessica; Musabirov, Ilya; Reza, Mohi; Bier, Norman; Willams, Joseph; Stamper, John (July 2024, PELE)

Full Text Available
Latent Skill Mining and Labeling from Courseware Content

https://doi.org/10.5281/zenodo.7086211

Matsuda, Noboru; Wood, Jesse; Shrivastava, Raj; Shimmei, Machi; Bier, Norman (January 2022, Journal of educational data mining)

A model that maps the requisite skills, or knowledge components, to the contents of an online course is necessary to implement many adaptive learning technologies. However, developing a skill model and tagging courseware contents with individual skills can be expensive and error prone. We propose a technology to automatically identify latent skills from instructional text on existing online courseware called Smart (Skill Model mining with Automated detection of Resemblance among Texts). Smart is capable of mining, labeling, and mapping skills without using an existing skill model or student learning (aka response) data. The goal of our proposed approach is to mine latent skills from assessment items included in existing courseware, provide discovered skills with human-friendly labels, and map didactic paragraph texts with skills. This way, mapping between assessment items and paragraph texts is formed. In doing so, automated skill models produced by Smart will reduce the workload of courseware developers while enabling adaptive online content at the launch of the course. In our evaluation study, we applied Smart to two existing authentic online courses. We then compared machine-generated skill models and human-crafted skill models in terms of the accuracy of predicting students’ learning. We also evaluated the similarity between machine-generated and human-crafted skill models. The results show that student models based on Smart-generated skill models were equally predictive of students’ learning as those based on human-crafted skill models— as validated on two OLI (Open Learning Initiative) courses. Also, Smart can generate skill models that are highly similar to human-crafted models as evidenced by the normalized mutual information (NMI) values.
more » « less
Full Text Available

Search for: All records